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ABSTRACT 


A name recognition system (FIG. 1 )used to provide 
access to a' database based on the voice recognition of a 
proper name spoken by a person who may not know the 
correct pronunciation of the name. During an enroll- 
ment phase (10), for each name-text entered (11) into a 
text database (12), text-derived recognition models (22) 
are created for each of a selected number of pronuncia- 
tions of a name-text, with each recognition model being 
constructed from a respective sequence of phonetic 
features (15) generated by a Boltzmann machine (13). 
During a name recognition phase (20), the spoken input 
(24,25) of a name (by a person who may not know the 
correct pronunciation) is compared (26) with the recog- 
nition models (22) looking for a pattern match — selec- 
tion of a corresponding name-text is made based on a 
decision rule (28). 


16 Claims, 3 Drawing Sheets 
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the same name, however reasonable such pronuncia- 

VOICE RECOGNITION OF PROPER NAMES tions may be. 
USING TEXT-DERIVED RECOGNITION MODELS Accordingly, a need exists for a computerized name 
' recognition system for use in accessing name-associa- 

TECHNICAL FIELD OF THE INVENTION 5 tive records in a database, such as a medical records 

The invention relates generally to voice recognition, database. The system should be speaker-independent in 
and more particularly relates to a method and system ^ rt would recognize names spoken by unknown 
for computerized voice recognition of a proper name ^ whcrc the ^ mi 8 ht not taow correct pro- 
using text-derived recognition models, in even greater nunciation of the name, 
particularity, for each name in a database, a separate SUMMARY OF THE INVENTION 
recognition model is created for each of a selected num- 
ber of pronunciations, with each recognition model Tht invention is a name recognition technique using 
being constructed from a respective sequence of pho- text-derived recognition models in recognizing the spo- 
netic features generated by a Boltzmann machine. ken rendition of name-texts (i.e., names in textual form) 

15 that are susceptible to multiple pronunciations, where 

BACKGROUND OF THE INVENTION the spoken name input (Lc, the spoken rendition of a 

Computerized voice recognition systems designed to name-text) is from a person who does not necessarily 

recognize designated speech sequences (words and/or know the proper pronunciation of the name-text. Thus, 

numbers) generally include two aspects: modeling and ^ to* system generates alternative recognition models 

recognition. Modeling involves creating a recognition from the name-text corresponding to a reasonable num- 

model for a designated speech sequence, generally using °er of pronunciations of the name, 

an enrollment procedure in which a speaker enrolls a In one aspect of the invention, the name recognition 

given speech sequence to create an acoustic reference. technique involves: (a) entering name-text into a text 

Recognition involves comparing an input speech signal 23 database which is accessed by designating name-text, 

with stored recognition models looking for a pattern (b) for each name-text in the text database, constructing 

match. a selected number of text-derived recognition models 

Without limiting the scope of the invention, this back- from the name-text, each text-derived recognition 

ground information is provided in the context of a spe- model representing at least one pronunciation of the 

cific problem to which the invention has applicability: a ^ name, (c) for each attempted access to the text database 

voice recognition system capable of accessing database by a spoken name input, comparing the spoken name 

records using the voice input of associated proper input with the stored text-derived recognition models, 

names. Such a system should accommodate a reason- If such comparison yields a sufficiently close pattern 

able number of alternative pronunciations of each name. match to one of the text-derived recognition modek 

A voice recognition system capable of recognizing 35 based on a decision rule, the name recognition system 

names would be useful in many data entry applications. provides a name recognition response designating the 

One such application is in the medical field in which name-text associated with such text-derived recognition 

patient records are routinely organized and accessed by model. 

both name and patient number. In an exemplary embodiment of the invention, the 
For health care providers, using patient numbers to 40, recognition models associated with alternative name 

access patient records is problematic due to the imprac- pronunciations are generated automatically from the 

ticality of remembering such numbers for any signifi- name-text using an appropriately . trained Boltzmann 

cant class of patients. Thus, name recognition is a vital machine. To obtain the alternative pronunciations, each 

step hi transforming medical record access from key- name-text is input to the Boltzmann machine a selected 

board input to voice input. 45 number of times (such as ten), with the machine being 

Permitting name-based access to patient records via placed in a random state prior to each input. For each 

computerized voice recognition involves a number of input, the Boltzmann machine generates a sequence of 

problems. For such a system to be practical, both recog- phonetic features, each representing at least one pro- 

nition model creation and name recognition would have nunciation of the name-text (and each of which may be 

to be speaker-independent. That is, speaker-independ- 50 different). 

ent recognition would be required because the identity When the input cycles for a name-text are complete, 

of the users would be unknown, while model generation the phonetic-feature sequences that are different are 

would nave to be speaker-independent because a user used to construct a corresponding number of recogni- 

would not necessarily know how to pronounce a pa- tion models using conventional Hidden Markov Model- 

tient's name. 55 ing (HMM) techniques— the recognition models are 

Current systems designed to generate name pronunci- based on phonetic models derived from a comprehen- 

ations from text are typically an adaptation of text-to- sive speech database providing good acoustic-phonetic 

speech technology, using extensive rule sets to develop coverage. The HMM recognition models are then 

a single proper pronunciation for a name based on the stored for use during name recognition operations, 

text of the name. Current systems designed to perform 60 Name recognition is performed using conventional 

name recognition typically require users to input the HMM recognition techniques. In particular, for each 

correct pronunciation of the name, for example, by spoken name input, the HMM recognition procedure 

pronouncing the name. assigns scores to text-derived recognition models (rep- 

These systems are designed to produce a single cor- resenting the likelihood that the spoken name input is an 

rect pronunciation of the name. Name recognition then 65 instance of that recognition model), and evaluates name 

requires the user to input the name using the nominal scores using a decision rule that selects either: (a) a 

pronunciation, i.e,. these name recognition systems are single name, (b) a set of N rank-ordered names, or (c) no 

not designed to recognize alternative pronunciations of name (a rejection). 
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Prior to pronunciation generation operations, the recognition models are generated corresponding to 

Boltzmann machine is appropriately trained using a alternative pronunciations. 

conventional training (or learning) algorithm and a FIG. 1 illustrates the exemplary name recognition 
training database. The recommended approach uses a technique. The name recognition system is used to pro- 
training database containing around 10,000 names — per- 5 vide access to a name-addressable database, such as a 
formance improves as the training set size increases. No medical records database. It includes an enrollment 
additional training is performed when using the Bolt* phase (10) and a name recognition phase (20). 
zmann machine to generate pronunciations for particu- During the enrollment phase (10), name-texts are 
lar names or name sets. entered (11) into the name-addressable text database 

The total number of names stored in the application 10 (12), such as during patient record creation, using nor- 

database is a performance issue — for typical applica- mal keyboard data entry. For the exemplary medical 

tions, the recommended approach is to limit the active records application, names are assumed to be available 

database size to about 500 names. on an institution-wide medical records database, al- 

The technical advantages of the invention include the though a scenario can be envisioned where common . 

following. The only input required by the name recog- 15 names would be added or deleted by pre-storing many 

nition system is text— it does not require speech input or of the most common names. 

other representation of correct pronunciation (such as For each name entered into the name-addressable text 

phonetic transcription). The system generates alterna- database, the name-text is repetitively input (during a 

tive recognition models, corresponding to different selected number of input cycles) to an appropriately 

pronunciations, thus allowing recognition of alternative 20 configured Boltzmann machine (13), which is reset to a 

reasonable pronunciations, not just the correct pronun- random state prior to each input. For each name-text 

ciation. input, the Boltzmann machine generates a phonetic 

BRIEF DESCRIPTION OF THE DRAWINGS ^^^S^SSSf^S^ 

For a more complete understanding of the invention, 25 pronunciation representations could be used, 
and for further features and advantages, reference is The Boltzmann machine (13) will generate different 
now made to the following Detailed Description of an phonetic feature sequences from the same input name- 
exemplary embodiment of the invention, taken in con- text, corresponding to different pronunciations of the 
junction with the accompanying Drawings, in which: name. When a name-text has been cycled through the 

FIG. 1 illustrates the name recognition system, in- 30 Boltzmann machine the selected number of times (14), 

eluding enrollment and name recognition; the resulting phonetic feature sequences are compared, 

FIG. 2 illustrates a Boltzmann machine used to gener- and the different sequences, representing different pro- 
ate alternative name pronunciation representations from nunciations, are selected (15) — not every phonetic fea- 
text input; ture sequence will be different, and indeed, it is possible 

FIGS. 3fl-3rf illustrate the HMM model construction 35 that none will be different (i.e., only a single pronuncia- 

operation; and tion is represented). 

FIG. 4 illustrates the higher-level text-derived recog- The number of different sequences generated from a 

nition model used in the name recognition process. given number of inputs of the same name-text will de- 

DETAILED DESCRIPTION OF THE „ ^£^^^^1^^. 

which should insure that phonetic feature sequences 

The Detailed Description of an exemplary embodi- will be generated for a high percentage of the reason- 

ment of the speaker-independent name recognition sys- able pronunciations of any name, 

tern is organized as follows: The different phonetic feature sequences are input to 

1. Name Recognition Technique 45 a HMM recognition model generator (16). For each 

2. Recognition Model Creation feature sequence, the HMM recognition model genera- 

2.1. Phonetic Feature Sequence Generation tor constructs a corresponding HMM recognition 

2.2. HMM Model Construction model using conventional HMM model generation 

3. Name Recognition techniques— the HMM recognition models are based on 

3.1. HMM Recognition 50 phonetic models derived from a speech database (18) 

3.2. Decision Rule providing good acoustic-phonetic coverage. The HMM 

4. Conclusion recognition models are stored in an HMM recognition 
The exemplary name recognition system is used to ac- model database (22), to be used for name recognition 
cess the records of a name-associative database, such as operations. 

a medical records database, in which each record has 55 Name recognition operations (20) are initiated by the 

associated with it a proper name. spoken input (24) of a name, which is converted (25) 

i xt _ n •** *t* i_ * into a corresponding speech signal. 

1. Name RecogniUon Technique yhc ipeechtfgn! iis input intoan HMM recognition 

The basic name recognition technique involves creat- engine (26). Using conventional HMM recognition 

ing acoustic recognition models from the text of previ- 60 techniques, the HMM recognition engine accesses the 

ously entered names prior to any name recognition HMM recognition model database (22), and compares 

operations. The exemplary name recognition technique the speech signal with the HMM recognition models 

creates the recognition models using a Boltzmann ma- looking for a pattern match. If such comparison yields a 

chine to generate, from input name-text (i.e., the textual sufficiently close pattern match to one of the stored 

form of a name), phonetic feature sequences (or other 65 text-derived recognition models in terms, of a decision 

pronunciation representations) that are used by a PIMM rule (28), the HMM recognition engine provides a cor- 

(Hidden Markov Model) generator to construct HMM responding name recognition response designating the 

recognition models. For each name, alternative HMM name-text associated with such recognition model.. 
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The name recognition response is used to access the Both subnetworks 31 and 32 are sliding windows. 

associated record in the medical database. That is, each input name is moved through both win- 

i » ... . , „ dows simultaneously, with each letter in turn placed in 

2. Recognition Model Creation the p(Mm m ^ ^ unk ^ „ ^ „ ^ 

For the exemplary name recognition system, the rec- 5 output 37 represents the set of possible sound(s) corre- 

ognition models associated with the different pronunci- spending to the center letters) in the string, 

ations for each name are created in a two step proce- This exemplary approach to the configuration of the 

dure that involves using a Boltzmann machine to gener- Boltzmann machine for generating phonetic feature 

ate phonetic feature sequences, and then using an HMM sequences is based on two design criteria: (a) generally, 

recognition model generator to construct the recogni- 10 a relatively small amount of contextual information will 

tion models from the phonetic feature sequences. be sufficient to narrow the range of possible sound cor- 

e _ . respondences to a small set, but (b) choosing a correct 

2.1. Phonetic Feature Sequence Generation ^ from ^ M may require information occurring 

A Boltzmann machine is a network that is trained to at more remote points in the name, 

represent the probability density distribution of observ- 15 For each position in the windows represented by the 

ables in a particular domain. It comprises simple inter- input units 33 and 34, there are 26 or 27 input units 

connected units, at least some of which are external corresponding to the letters of the alphabet plus 

(input/output) units. "space" ("space" is omitted where unnecessary, i.c, in 

Each unit can be either "on" or "ofT— the state of the central position). The unit corresponding to the 

each unit (if not fixed) is a probabilistic function of the 20 current letter is on, while all other units are off. Each of 

states of the units to which it is connected and the these input units is connected to each unit in the corre- 

strength of the real- valued weights on the connections. spending hidden unit layer, and each of the hidden units 

All connection weights between units are symmetrical, is connected to each output unit, 

representing mutually excitatory or mutually inhibitory The output 37 of the Boltzmann machine is a pho- 

relationships. Each configuration of the network has an 25 netic feature sequence: the output at each step repre- 

energy value that is a function of the states and connec- sents a sequence of N features, where N is a small inte- 

tion weights for all units. ger (possibly 1). 

The Boltzmann machine training algorithm is a pro- Alternatively, the output units could represent a se- 

cedure for gradually adjusting the weights on connec- quence of N phones or phonemes. However, using pho- 

tions between units so that the network comes to model 30 netic features rather than other pronunciation represen- 

the domain of interest. It involves alternate cycles in tations such as phones or phonemes requires fewer out- 

which (a) the states of external units are either clamped put units, and allows greater flexibility, 

(determined externally and held constant) or free (set The machine uses 2 MXN output units, where M is 

randomly and allowed to change), while (b) all internal the number of phonetic features and N is the length of 

units are free. 35 the output sequence. Each feature has a positive unit 

For each initial configuration, a conventional siinu- representing the presence of the feature and a negative 

lated annealing procedure is used to bring the network unit representing the absence of the feature. A noncom- 

to a state of equilibrium. Connection weights are ad- mittal or "neither" response (such as for a silent letter) 

justed to reduce the difference in energy between the is indicated by a zero value on both positive and nega- 

clamped and free configurations. 40 tive units. In cases where N is greater than the length of 

Once trained, the network can perform pattern com- the output sequence, adjacent sets of output units are 
pletion tasks probabilistically: a subset of its external assigned identical values during training, 
units are set to values representing the input, and all For the exemplary Boltzmann machine, the subnet- 
other units are set randomly. Activations are propa- work 31 contains 80 input units (a 3-letter window) and 
gated through the network, with the resulting states of 45 416 hidden layer units. The subnetwork 32 contains 188 
the remaining external units representing the output. If input units (a 7-letter window) and 64 hidden layer 
the network has been trained successfully, the set of units. The output layer 37 contains 80 units (2 units each 
outputs produced for a given input represents the prob- for 20 phonetic features, 2-phone sequence), 
ability density of these outputs for the given input in the The connection weight values associated with the 
domain represented by the network. 50 network are derived using the conventional Boltzmann 

The Boltzmann machine architecture and training machine training algorithm and a training database of 

algorithm is particularly useful in the context of the names. The training database contains the spelling and 

exemplary name recognition system because of the need all expected pronunciations of each name. In the exem- 

to produce alternative outputs for the same input. The plary system, pronunciations are represented as sequen- 

exemplary name recognition system uses a conventional 55 ces of combinations of phonetic features, using the same 

Boltzmann machine architecture and training proce- feature set as in the Boltzmann machine (for example, 

dure. +/— SYLLABIC, +/- VOICED, -f/— NASAL, 

FIG. 2 illustrates the exemplary approach to imple- +/— LABIAL), 

menting the Boltzmann machine for generating pho- During the clamped cycles in the training procedure, 

netic feature sequences. The Boltzmann machine 30 60 each name in the training database is presented to the 

includes two subnetworks 31 and 32, each comprising a network in turn. The input units are clamped to the 

respective set of input units 33 and 34, and a respective values corresponding to the letters in the name while 

set of internal units (called a hidden layer) 35 and 36. the output units are clamped to values corresponding to 

The machine has a single set of output units 37. one expected pronunciation. 

The subnetwork 31 contains input units to scan a 65 Simulated annealing is used to bring the network to 

small number of letters (for example, three) at a time. equilibrium, and its energy is then computed. This step 

The subnetwork 32 contains input units to scan a larger is repeated for each expected pronunciation for each 

number of letters (for example, seven) at a time, letter in each name. 
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Network performance is tested by (a) clamping input state of one model with the initial state of the next 

units to values corresponding to letters in names, (b) model, could be used. 

setting all other units to random values, (c) allowing FIG. 3c illustrates recognition model construction 
activations to propagate through the network, and (d) where more than one phonetic unit model is selected at 
observing the values of the output units. Performance is 5 a given point in the sequence. Branches are constructed 
tested at intervals and training is terminated when the in the recognition model allowing any one of the alter- 
performance of the network reaches an acceptable nate phonetic unit models to occur at that point, 
level. FIG. 3d illustrates an exemplary procedure for con- 
By repeatedly inputting the same name into the Bolt- structing a single text-derived recognition model repre- 
zmann machine, different phonetic feature sequences 10 senting alternate pronunciations. First, a text-derived 
can be produced, corresponding to alternative plausible recognition model is constructed for each distinct Bolt- 
pronunciations of the name. Thus, the goal is not to zmann machine output sequence. These text-derived 
determine a nominal or "correct" pronunciation, but to recognition models can be used separately, or two or 
create a recognition model for any pronunciation likely more of them can be combined for greater efficiency 
to be used by a person reading the name. 15 and reduced memory— this combining procedure is 

2.2. HMM Recognition Model Construction °™ut alternate text-derived recognition models rep- 

HMM text-derived recognition models are con- resenting alternative pronunciations can be combined 

structed by using the output of the Boltzmann machine into a single network by collapsing common paths (i.e. 

to select from a library of phonetic unit models, and 20 shared phonetic units at corresponding points in the 

then combining these phonetic models into name recog- phonetic unit sequence). In this case, the number of 

nition models. recognition models for a database of N names is simply 

The phonetic model library comprises HMM pho- N, but each recognition model represents M alternative 

netic unit models representing phonetic units. Each pronunciations, where M is greater than or equal to 1, 

phonetic unit model represents sets of expected acoustic and where M may vary with the name, 
features and durations for these features. 

FIG. 3a illustrates an exemplary phonetic unit model 3 * Name Recognition 

based on cepstral feature analysis and a simple exponen- Name recognition is performed using conventional 

tial duration model. Other types of models, such as ^ HMM recognition technology. Input to the system 

finite duration could be used. consists of a spoken name input (i.e., the spoken rendi- 

The phonetic unit models are created and trained tion of a name-text) from a person who has available the 

using conventional HMM model generation techniques, text of the name but who may or may not know the 

based on a speech database providing good coverage of correct pronunciation. 

the acoustic-phonetic features of speech for the context 35 This spoken name input is converted to a speech 

in which the name recognition system will be used. This signal and recognized by a conventional HMM recogni- 

speech database is distinct from the name database used tion procedure. The output of the HMM recognition 

in training the Boltzmann machine, which does not procedure is then evaluated according to a decision 

contain any speech material— the speech database typi- rule. The output of the decision process is either one or 

cally does not include any of the names from the name 40 more name-texts, or a rejection, i.e., a decision that no 

training database. name-text was recognized. 

The phonetic feature sequences generated by the m *w » 

Boltzmann machine are used to select corresponding 3J * HMM Recognition 

phonetic unit models from the phonetic model library. The conventional HMM recognition procedure per- 

That is, for each set of phonetic features observed in the 45 forms a pattern-matching operation, comparing the 

Boltzmann machine output, the phonetic unit model input speech signal to relevant HMM text-derived rec- 

representing the phonetic unit having that set of fea- ognition models. 

tures is selected. Adjacent identical feature sets are Referring to FIG. 1, for name recognition, the speech 
collapsed to select a single phonetic unit model. signal is compared to the HMM text-derived recogni- 
In cases where the Boltzmann machine phonetic fea- so tion models (22) constructed in advance from the pho- 
ture sequence is consistent with more than one phonetic netic unit models (18), which are based on predicted 
unit model, the recommended approach is to select all phonetic feature representations for that name-text. The 
corresponding phonetic unit models. If the Boltzmann HMM recognition models represent all predicted pro- 
machine phonetic feature sequence is not consistent nunciations of all name-texts in the text database (12). 
with any phonetic unit model, the recommended ap- 55 Thus, for a text database of N names, recognition is an 
proach is to discard that output sequence. Alterna- N-way discrimination problem. Each of the N name- 
tively, the most nearly consistent model(s) can be se- texts is represented by a set of M models, where M is the 
lected, using a heuristic procedure to determine degree number of pronunciation separately modeled for each 
of consistency. However, certain outputs (e.g. both name— M is greater than or equal to 1, and typically 
positive and negative units off) are taken to represent 60 varies across names. 

"no phonetic model", Le. the input letter is a "silent" FIG. 4 illustrates an exemplary higher-level text- 
letter, derived recognition model used in the recognition pro- 
Hie selected sequential phonetic units for each name cess — this model can be generalized to names contain- 
afe then sequentially combined into a recognition model ing arbitrary numbers of components, 
representing the name. 65 The recognition model is a finite state machine con- 
FIG. 3b illustrates an the exemplary concatenation taining a sequence of states and transitions. Each transi- 
approacb to combining phonetic unit models. Other tion between states represents one component of the 
conjoining procedures, such as overlapping the final name (e.g., first name, last name). 
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The recognition model also includes transitions al- 
lowing optional nonspeech (e.g., silence) initially, be- 
tween names, and finally. These nonspeech transitions 
allow the speech to be recognized without performing a 
prior speech endpointing procedure to determine the 5 
endpoints of each component of the name. They also 
allow pauses of variable lengths, including no pause, 
between components of the name. 

3.2. Decision Rule 10 

The HMM recognition procedure outputs one or 
more name scores representing the likelihood that the 
speech input is an instance of that text-derived recogni- 
tion model. These name scores are then evaluated using 
a decision rule that selects either: (a) a single name-text, IS 
(b) a set of N rank-ordered name-texts, or (c) no name (a 
rejection). 

In the exemplary HMM recognition procedure, the 
recognizer outputs name scores for all text-derived 
recognition models of all names. For names which have 20 
multiple recognition models representing alternative 
pronunciations, a single composite name score is de- 
rived from the separate name scores by selecting the 
single best name score. 

The best (maximum likelihood) name score is then 25 
compared to the name score for the second most likely 
name. If the best name score exceeds a predetermined 
absolute score threshold, and if the difference between 
the best and second best name scores also exceeds a 
predetermined difference score threshold, then the cor- 30 
responding name-text having the best name score is 
output as the recognized name. Otherwise, no name is 
output, i.e., the recognizer reports that no name was 
recognized. 

Various alternate decision rules can be used. For 35 
example, in the simplest case, the recognition procedure 
outputs only one name score, i.e., the score for the 
single most likely text-derive recognition model, and a 
simple score threshold is used to accept or reject that 
name. Alternatively, the recognizer may output multi- 40 
pie name scores, and. a rank-ordered list of the best N 
name-texts may be selected by exercising a decision rule 
for the most likely candidates. In this case, a second 
decision procedure employing independent information 
is used to select a single name-text from the list For 45 
example, a single name-text may be selected by refer- 
ence to other information stored in the application data- 
base (such as a patient's physician or diagnosis), or may 
be selected by the user. 

* ~ * • 50 
4. Conclusion 

Although the Detailed Description of the invention 
directed to certain exemplary embodiments, various 
modifications of these exemplary embodiments, as well 
as alternative embodiments, will be suggested to those 55 
skilled in the art For example, the invention has general 
applicability for name recognition systems where the 
text of names are input in advance and used to. create 
alternative recognition models associated with alterna- 
tive pronunciations. 60 

It is to be understood that the invention encompass 
any modifications or alternative embodiments that fall 
within the scope of the appended Claims. 
. What is claimed is: 

1. A method of proper name recognition using text- 65 
derived recognition models to recognize spoken rendi- 
tion of name-texts (i.e., names in textual form) that are 
susceptible to multiple pronunciations, where spoken 


10 

name input (i.e., spoken rendition of a name-text) is 
from a person who does not necessarily know how to 
properly pronounce the name-text, comprising the 
steps: 

entering name-text into a text database in which the 
database is accessed by designating name-text; 

for each name-text in the text database, constructing 
a selected number of text-derived recognition mod- 
els from the name-text, each text-derived recogni- 
tion model representing at least one pronunciation 
of the name; 

for each attempted access to the text database by a 
spoken name input, comparing the spoken name 
input with the text-derived recognition models; 
and 

if such comparision yields a sufficiently close pattern 
match to one of the text-derived recognition mod- 
els based on a decision rule, providing a name rec- 
ognition response designating the name-text associ- 
ated with such text-derived recognition model. 

2. The name recognition method of claim 1, wherein 
the step of constructing a selected number of text- 
derived recognition models is accomplished using a 
neural network. 

3. The name recognition method of claim 1, where in 
the step of constructing a selected number of recogni- 
tion models comprises the substeps: 

for each name in the text database, inputting the 
name-text into an appropriately trained Boltzmann 
machine for a selected number of input cycles, with 
the machine being placed in a random state prior to 
each input cycle; 

for each input cycle, generating a corresponding 
pronunciation representation sequence of at least 
one pronunciation for the name-text; 

when the input cycles are complete, constructing 
from the pronunciation representation sequences 
that are different at least one text-derived recogni- 
tion model representing at least one pronunciation 
of the name-text. 

4. The method of proper name recognition using 
text-derived recognition models of claim 3, wherein the 
pronunciation representations are phonetic features. 

5. The method of proper name recognition using 
text-derived recognition models of claim 3, wherein 
said Boltzmann machine comprises: 

small and large sliding-window subnetworks, each 
including a respective set of input units and a re- 
spective set of interna] units; and 

a set of output units; 

said small sliding window subnetwork being com- 
posed of a smaller number of input units than said 
large sliding window subnetwork; 

such that, for each input cycle, the step of generating 
a corresponding pronunciation representation se- 
quence is accomplished by moving the name-text 
through both windows simultaneously, with each 
letter in turn placed in a central position in the 
respective sets of input units. 

6. The method of proper name recognition using 
text-derived recognition models of claim 1, wherein the 
step of constructing a selected number of text-derived 
recognition models is accomplished using HMM model- 
ing. 

7. The method of proper name recognition using 
text-derived recognition models of claim 6, wherein the 
step of constructing text-derived recognition models 
comprises the substeps of: 
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creating a phonetic model library of phonetic unit 
models representing phonetic units, where each 
phonetic unit model represents sets of expected 
acoustic features and durations for such features; 

for each pronunciation representation sequence gen- 5 
erated by the Boltzmann machine, searching the 
phonetic model library for corresponding phonetic 
unit models; and 

if at least one corresponding phonetic unit model is 
found, selecting such phonetic unit model; 10 

otherwise, discarding such pronunciation representa- 
tion sequence; and 

after all pronunciation representation sequences have 
been used to search the phonetic model library, 
constructing a corresponding text-derived recogni- 15 
tion model using the selected phonetic unit models. 

8. The method of proper name recognition using 
text-derived recognition models of claim 7, wherein the 
substep of selecting such phonetic unit model comprises 
the substep; 

if at least one phonetic unit model is found that corre- 
sponds to the pronunciation representation se- 
quence with a predetermined degree of consis- 
tency, selecting such phonetic unit model. ^ 

9. The method of proper name recognition using text 
derived recognition models of claim 1, wherein the step 
of providing a name recognition response designating 
the name-text associated with such text-derived recog- 
nition model is accomplished according to the decision- 3Q 
rule substeps of: 

for each spoken name input, assigning first scores to 
the text-derived recognition models representing a 
likelihood that the spoken name input is an instance 
of that recognition model; and 35 

evaluating the first scores using a decision rule that 
selects as a name recognition response either: (a) a 
single name-text, (b) a set of N rank-ordered name- 
texts, or (c) no name. 

10. The method of proper name recognition using ^ 
text-derived recognition models of claim 9, wherein the 
step of assigning first scores comprises the substeps of: 

for each spoken name input, assigning name scores to 
all text-derived recognition models for all name 
texts representing of that likelihood that the spoken 45 
name input is an instance of that recognition model; 
and 

for names with multiple text-derived recognition 
models representing alternative pronunciations, 
assigning a single name score derived by selecting 50 
a single best name score associated with a text- 
derived recognition model which is most likely. 

11. The method of proper name recognition using 
text-derived recognition models of claim 10, wherein 
the step of evaluating the first scores using a decision 55 
rule is accomplished by: 

comparing the best name score is then compared to 
the name score for a text-derived recognition 
model which is second most likely; and 

if the best name score exceeds a predetermined abso- 60 
lute score threshold, and if the difference between 
the best and second best name scores also exceeds 
a predetermined difference score threshold, then 
the name-text associated the text-derived recogni- 
tion model having the best score is output as the 65 
name recognition response; 

otherwise, the name recognition response indicates 
no name-text. 


12. A method of proper name recognition using text- 
derived recognition models to recognize spoken rendi- 
tion of name-texts (i.e., names in textual form) that are 
susceptible to multiple pronunciations, where spoken 
name input (i.e., the spoken rendition of a name-text) is 
from a person who does not necessarily know how to 
properly pronounce the name-text, comprising the 
steps: 

entering name-text into a text database in which the 
database is accessed by designating name-text; 

for each name-text in the text database, inputting the 
name-text into an appropriately trained Boltzmann 
machine for a selected number of input cycles, with 
the machine being placed in a random state prior to 
each input cycle; 

for each input cycle, generating a corresponding 
phonetic feature sequence of at least one pronunci- 
ation for the name-text; 

when the input cycles are complete, constructing 
from the phonetic feature sequences that are differ- 
ent at least one text-derived recognition model 
representing at least one pronunciation of the 
name-text; 

for each attempted access to the text database by a 
spoken name input, comparing the spoken name 
input with the stored text-derived recognition 
models; and 

if such comparison yields a sufficiently close pattern 
match to one of the text-derived recognition mod- 
els based on a decision rule, providing a name rec- 
ognition response designating the name-text associ- 
ated with such text-derived recognition model. 

13. The method of proper name recognition using 
text-derived recognition models of claim 12, wherein 
the step of constructing from the phonetic feature se- 
quences that are different at least one text-derived rec- 
ognition models comprises the substeps of: 

creating a phonetic model library of phonetic unit 
models representing phonetic units, where each 
phonetic unit model represents sets of expected 
acoustic features and durations for such features; 

for each phonetic feature sequence generated by the 
Boltzmann machine, searching the phonetic model 
library for corresponding phonetic unit models; 
and 

if at least one corresponding phonetic unit model is 
found, selecting such phonetic unit model; 

otherwise, discarding such phonetic feature sequen- 
ces; and 

after all phonetic feature sequences have been used to 
search the phonetic model library, constructing a 
corresponding text-derived recognition model 
using the selected phonetic unit models. 

14. The method of proper name recognition using 
text-derived recognition models of claim 12, wherein 
the step of providing a name recognition response des- 
ignating the name-test associated with such text-derived 
recognition model is accomplished according to the 
decision-rule substeps of: 

for each spoken name input, assigning first scores to 
the text-derived recognition models representing a 
likelihood that the spoken name input is an instance 
of that recognition model; and 

evaluating the first scores using a decision rule that 
selects as a name recognition response either: (a) a 
single name-text, (b) a set of N rank-ordered name- 
texts, or (c) no name. 
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15. A proper name recognition system using text- 
derived recognition models to recognize spoken rendi- 
tion of name-texts (i.e., names in textual form) that are 
susceptible to multiple pronunciations, where spoken 
name input (i.e., the spoken rendition of a name-text) is 
from a person who does not necessarily know how to 
properly pronounce the name-text, comprising: 

a text database into which are entered name-texts, 
where the database is accessed by designating 
name-text; 

an appropriately trained Boltzmann machine respon- 
sive to the input of a name-text for generating a 
corresponding phonetic feature sequence of at least 
one pronunciation for the name-text; 

each name-text being input to said Boltzmann ma- 
chine a selected number of input cycles, with the 
machine being placed in a random state prior to 
each input cycle; 

a text-derived recognition model generator for con- 
structing, after the selected number of input cycles 20 
for . a name-text is complete, from the phonetic 
feature sequences that are different at least one 
text-derived recognition model representing at 
least one pronunciation of the name-text; 

a name-text recognition engine for comparing, for 25 
each attempted access to the text database by a 
spoken name input, such spoken name input with 


10 


15 


the generated text-derived recognition models, and 
if such comparison yields a sufficiently close pat- 
tern match to one of the text-derived recognition 
models based on a decision rule, providing a name 
recognition response designating the name-text 
associated with such text-derived recognition 
model 

16. The proper name recognition system using text- 
derived recognition models of claim 15, further com- 
prising: 

a phonetic model library of phonetic unit models 
representing phonetic units, where each phonetic 
unit model represents sets of expected acoustic 
features and durations for such features; 

such that, for each phonetic feature sequence gener- 
ated by the Boltzmann machine, said text-derived 
recognition model generator searches the phonetic 
model library for corresponding phonetic unit 
models, and if at least one corresponding phonetic 
unit model is found, selects such phonetic unit 
model, otherwise, it discards such phonetic feature 
sequence; and 

after said text-derived recognition model generator 
has so processed all phonetic feature sequences, it 
constructs a corresponding text-derived recogni- 
tion model using the selected phonetic unit models. 
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(57) ABSTRACT 

A distributed speech processing system for constructing 
speech recognition reference models that are to be used by 
a speech recognizer in a small hardware device, such as a 
personal digital assistant or cellular telephone. The speech 
processing system includes a speech recognizer residing on 
a first computing device and a speech model server residing 
on a second computing device. The speech recognizer 
receives speech training data and processes it into an inter- 
mediate representation of the speech training data. The 
intermediate representation is then communicated to the 
speech model server. The speech model server generates a 
speech reference model by using the intermediate represen- 
tation of the speech training data and then communicates the 
speech reference model back to the first computing device 
for storage in a lexicon associated with the speech recog- 
nizer. 

32 Claims, 2 Drawing Sheets 
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SPEECH RECOGNITION TRAINING FOR 
SMALL HARDWARE DEVICES 

BACKGROUND AND SUMMARY OF THE 
INVENTION 

The present invention relates generally to speech recog- 
nition systems, and more particularly, the invention relates 
to a system for training a speech recognizer for use in a small 
hardware device. 

The marketing of consumer electronic products is very 
cost sensitive. Reduction of the fixed program memory size, 
the random access working memory size or the processor 
speed requirements results in lower cost, smaller and more 
energy efficient electronic devices. The current trend is to 
make these consumer products easier to use by incorporating 
speech technology. Many consumer electronic products, 
such as personal digital assistants (PDA) and cellular 
telephones, offer ideal opportunities to exploit speech 
technology, however they also present a challenge in that 
memory and processing power is often limited within the 
host hardware device. Considering the particular case of 
using speech recognition technology for voice dialing in 
cellular phones, the embedded speech recognizer will need 
to fit into a relatively small memory footprint. 

To economize memory usage, the typical embedded 
speech recognition system will have very limited, often 
static vocabulary. In this case, condition-specific words, 
such as names used for dialing a cellular phone, could not be 
recognized. In many instances, the training of the speech 
recognizer is more costly, in terms of memory required or 
computational complexity, than is the speech recognition 
process. Small low-cost hardware devices that are capable of 
performing speech recognition may not have the resources 
to create and/or update the lexicon of recognized words. 
Moreover, where the processor needs to handle other tasks 
(e.g., user interaction features) within the embedded system, 
conventional procedures for creating and/or updating the 
lexicon may not be able to execute within a reasonable 
length of time without adversely impacting the other sup- 
ported tasks. 

The present invention addresses the above problems 
through a distributed speech recognition architecture 
whereby words and their associated speech models may be 
added to a lexicon on a fully customized basis. In this way, 
the present invention achieves three desirable features: (1) 
the user of the consumer product can add words to the 
lexicon, (2) the consumer product does not need the 
resources required for creating new speech models, and (3) 
the consumer product is autonomous during speech recog- 
nition (as opposed to during speech reference training), such 
that it does not need to be connected to a remote server 
device. 

To do so, the speech recognition system includes a speech 
recognizer residing on a first computing device and a speech 
model server residing on a second computing device. The 
speech recognizer receives speech training data and pro- 
cesses it into an intermediate representation of the speech 
training data. The intermediate representation is then com- 
municated to the speech model server. The speech model 
server generates a speech reference model by using the 
intermediate representation of the speech training data and 
then communicates the speech reference model back to the 
first computing device for storage in a lexicon associated 
with the speech recognizer. 

For a more complete understanding of the invention, its 
objects and advantages refer to the following specification 
and to the accompanying drawings. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a diagram illustrating a personal digital assistant 
in the context of a distributed speech recognition system in 
5 accordance with the present invention; and 

FIG. 2 is a diagram illustrating a cellular telephone in the 
context of the distributed speech recognition system of the 
present invention. 

10 DETAILED DESCRIPTION OF THE 

PREFERRED EMBODIMENTS 

The techniques employed by the present invention can be 
applied in a number of useful applications. For illustration 
purposes, a preferred embodiment of the invention will first 

15 be described within a personal digital assistant (PDA) appli- 
cation. Following this description, another example of a 
preferred embodiment will be presented in the context of a 
cellular telephone application. Of course, it will be appre- 
ciated that the principles of the invention can be employed 

20 in a wide variety of other applications and consumer prod- 
ucts in which speech recognition is employed. 

Referring to FIG. 1, a personal digital assistant device is 
illustrated at 10. The device has a display screen 12 that 

25 presents information to the user and on which the user can 
enter information by writing on the display using a stylus 14. 
The personal digital assistant 10 includes a handwriting 
recognition module that analyzes the stroke data entered by 
the user via the stylus. The handwriting recognition module 

3Q converts the handwritten stroke data into alphanumeric text 
that may be stored in a suitable form (e.g., ASCII format) 
within a portion of the random access memory contained 
within the PDA 10. 

In a typical PDA device, the operating system of the 

35 device manages the non-volatile memory used for storing 
data entered by the user. Although the precise configuration 
and layout of this non-volatile memory is dependent on the 
particular operating system employed, in general, a portion 
of memory is allocated for storing alphanumeric data 

40 entered by the user in connection with different applications. 
These applications include address books, e-mail address 
directories, telephone dialers, scheduling and calendar 
applications, personal finance applications, Web-browsers 
and the like. For illustration purposes, an address book 

45 application 20 is illustrated in FIG. 1. When the user enters 
names, addresses and phone numbers using the stylus, the 
alphanumeric data corresponding to the user-entered infor- 
mation is stored in a portion of the system's non-volatile 
random access memory which has been designated as word 

50 memory 21 in FIG. 1. 

The PDA 10 of the present embodiment is a speech - 
enabled device. It includes a microphone 16, preferably 
housed within the device to allow the user to enter speech 
commands and enter speech data as an alternative to using 

55 the stylus. For instance, the user may speak the name of a 
person whose address and telephone number they want to 
retrieve from their address book. Preferably, the PDA 10 also 
includes an integral speaker 18 through which digitally 
recorded audio data and synthesized speech data can be 

60 transmitted to the user. 

Speech data entered through microphone 16 is processed 
by a speech recognizer module 22 within the PDA 10. The 
speech recognizer may be a stand alone application running 
on the PDA device, or it may be incorporated into the 

65 operating system of the PDA device. There are a variety of 
different speech templates upon which speech recognizer 22 
may be based. Hidden Markov Models (HMMs) are popular 
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today and may be used to implement the illustrated embodi- 
ment. Alternatively, other templates can be employed, such 
as a model based on high similarity regions as proposed by 
Morin et al. in U.S. Pat. Nos. 5,684,925, 5,822,728 and 
5,825,977 which are incorporated herein by reference. 5 

Speech recognizer 22 works in conjunction with a locally 
stored lexicon 24 of words that may be recognized by the 
system. The lexicon 24 is arranged such that there is a 
speech model associated with each word that is recognizable 
by the system. This arrangement is illustrated in FIG. 1 by 10 
a data structure that associates a unit of word data 26 with 
a corresponding speech model 28. In this way, the speech 
recognizer 22 retrieves the alphanumeric text for the word 
that matches the input speech data. In the case of the address 
book, the application 20 can retrieve the appropriate address 15 
and telephone number using the alphanumeric text for the 
spoken name as provided by the speech recognizer 22. 

The personal digital assistant 10 presents a challenge of 
trying to achieve each of the previously described desirable 
features. Thus, the PDA 10 employs a distributed speech 20 
recognition architecture whereby words and their associated 
speech models may be added to lexicon 24 on a fully 
customized basis. Using the stylus or other suitable input 
device, such as a keyboard, the user enters words into word 
memory 21. The system then acquires speech reference 25 
models corresponding to those words by accessing a second 
computing device. 

In the presently preferred embodiment, a reference model 
server supplies the speech models for newly entered words. 3Q 
The reference model server 40 may be implemented on a 
suitable host server computer 42, typically at a remote site. 
The PDA 10 and server computer 42 communicate with one 
another by suitable communication modules 30 and 44. In 
this regard, the communication modules can take many 35 
forms in order to support popular communication hardware 
and software platforms. For instance, the PDA 10 and server 
computer 42 may be configured to communicate with one 
another through a RS232 interface in which the PDA 10 
plugs into a cradle connected by is cable to a serial port of 4Q 
the server computer 42. The PDA 10 and host computer 42 
may also communicate via a public telephone network or a 
cellular telephone network using suitable modems. 
Alternatively, the PDA 10 and host computer 42 may 
communicate through infrared link, Ethernet or other suit- 45 
able hardware platform using common communication pro- 
tocals (e.g., TCP/IP). In this way, the personal digital 
assistant 10 and server computer 42 may be configured to 
communicate with each other over the Internet. 

The reference model server 40 preferably includes a 50 
database of speaker independent models 46, comprising a 
relatively extensive set of words and their associated speech 
reference models. When the user enters a new word in the 
PDA 10, the word is communicated via communication 
modules 30 and 44 to the reference model server 40. If the 55 
user-supplied word is found in the database 46, the speech 
model corresponding to that word may be transferred to the 
PDA through the communication modules. The PDA then 
stores the newly acquired speech reference model in its 
lexicon 24, such that the speech reference model is associ- 60 
ated with the user-supplied word as illustrated by data 
structures 26 and 28. 

In the event the user-supplied word is not found in the 
database 46, the system will generate a speech reference 
model for the word. To do so, the system employs a 65 
phoneticizer 48 and a reference model training module 50. 
First, the phoneticizer 48 parses the letters that make up the 
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word and then employs a decision tree network to generate 
one or more hypothesized pronunciations (i.e., phonetic 
transcriptions) of the user-entered word. This set of synthe- 
sized pronunciations then serves as input to the reference 
model training module 50 which in turn creates a new 
speech reference model is based on the speech model 
template associated with the reference model training mod- 
ule 50. In a preferred embodiment, Hidden Markov Model 
(HMM) is used as the speech model template for the training 
module 50. The reference model training module 50 may 
also employ a procedure for ascertaining the optimal speech 
model for the phonetic transcription input. 

Alternatively, if the user-entered word is not found in the 
database 46, the system may generate a speech reference 
model based on speech training data that corresponds to the 
user-supplied word. In this case, the user speaks the word for 
which the new speech reference model is desired. The 
system receives the user-supplied word as audio data via the 
microphone 18. Speech recognizer 22 converts the audio 
data into a digitized input signal and then into a parameter- 
ized intermediate form. In a preferred embodiment of the 
present invention, the intermediate representation of the 
word is a vector of parameters representing the short term 
speech spectral shape of the audio data. The vector of 
parameters may be further defined as, but not limited to 
pulse code modulation (PCM), ^-law encoded PCM, filter 
bank energies, line spectral frequencies, linear predictive 
coding (LPC) cepstral coefficients or other types of cepstral 
coefficients. One skilled in the art will readily recognize that 
the system may prompt the user for one or more utterances 
of the user-supplied word in order to provide ample speech 
training data. In this case, the intermediate representation of 
the word is comprised of a sequence of vectors having one 
sequence for each training repetition. When the word is not 
found in the lexicon, the intermediate form is then commu- 
nicated via communication module 30 and 44 to the refer- 
ence model server 40. 

The reference model server 40 passes the intermediate 
representation of the word to the reference model training 
module 50, where a speech model is constructed using the 
speech model template. To construct a speech model, the 
reference model training module 50 may decode the lime 
series of parameter vectors in the speech training data by 
comparison to a set of phonetic Hidden Markov Models, 
thereby obtaining a phonetic transcription of the utterance in 
the speech training data. In this case, the transcription serves 
as the speech reference model. Alternatively, the reference 
model training module 50 may align the time series of 
parameter vectors for each repetition of the speech utterance 
in the speech training data as is well known in the art. In this 
case, the reference model training module 50 computes the 
mean and variance of each parameter at each time interval 
and then constructs the speech reference model from these 
means and variances (or functions of these means and 
variances). In either case, the newly constructed speech 
reference model is thereafter sent back over the communi- 
cation link to the PDA. Finally, the new speech reference 
model along with the alphanumeric representation of the 
user-supplied word is added to lexicon 24. 

A second preferred embodiment of the present invention 
will be described in relation to a cellular telephone appli- 
cation as shown in FIG. 2. The cellular telephone handset 
device 60 contains an embedded microphone 62 for receiv- 
ing audio data from the user and an embedded speaker 64 for 
transmitting audio data back to the user. The handset device 
60 also includes a telephone keypad 66 for dialing or for 
entering other information, as well as a small liquid crystal 
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display screen 68 that presents information to the user. Thus, 
the cellular telephone lends itself to different types of 
embedded speech-enabled applications. 

Although various types of speech-enabled applications 
are envisioned, an automatic voice dialing feature is illus- 
trated in FIG. 2. To voice dial the telephone, a user merely 
speaks the name of the person they wish to call. The audio 
data corresponding to the spoken name is then processed by 
a speech recognizer module 22' within the handset device 
60. The speech recognizer 22' works in conjunction with a 
locally stored lexicon 24' of words that may be recognized 
by the system. As shown in FIG. 2, the lexicon 24' is 
arranged according to a data structure that associates each 
recognizable word with a corresponding speech reference 
model. 

If the name is recognized by the speech recognizer 22', the 
alphanumeric representation of the spoken word is passed 
along to an automatic dialer module 70. A portion of the 
system's non-volatile random access memory is used to 
maintain a mapping between names and telephone numbers. 
The automatic dialer module 70 accesses this memory space 
to retrieve the telephone number that corresponds to the 
alphanumeric representation of the spoken name and then 
proceeds to dial the telephone number. In this way, the user 
is able to automatically voice dial the cellular telephone. 

The cellular telephone also presents a challenge of trying 
to achieve each of the previously identified desirable fea- 
tures. Again, the cellular telephone employs a distributed 
speech recognition architecture whereby words and their 
associated speech models may be added to lexicon 24' on a 
fully customized basis. When the user-supplied name is not 
found in the lexicon 24', the user may enter the name by 
using either the keypad 66 or some other suitable input 
device. The alphanumeric data corresponding to the name is 
stored in a portion of the system's non-volatile random 
access memory which has been designated as word memory 
21'. The name is then communicated via communication 
modules 30* and 44' to the reference model server 40'. 

As previously described, the reference model server 40 4Q 
passes the intermediate representation of the name to the 
reference model training module 50\ where a speech model 
is constructed using the speech model template. Thereafter, 
the newly constructed speech reference model is sent back 
over the communication link to the telephone handset device 
60. Finally, the speech reference model along with a corre- 
sponding user-supplied word is added to lexicon 24' of the 
telephone handset device 60. 

For an automatic voice dialing application, it is envi- 
sioned that the lexicon 24' may also be configured to 
associate telephone numbers, rather than names, with a 
speech reference model. When the user speaks the name of 
the person they wish to call, the speech recognizer 22' works 
in conjunction with the lexicon 24' to retrieve the telephone 
number that corresponds to the spoken name. The telephone 
number is then directly passed along to the automatic dialer 
module 70. 

Hie foregoing discloses and describes merely exemplary 
embodiments of the present invention. One skilled in the art 
will readily recognize from such discussion, and from 
accompanying drawings and claims, that various changes, 
modifications, and variations can be made therein without 
departing from the spirit and scope of the present invention. 

What is claimed is: 

1. A speech processing system for constructing speech 
recognition reference models, comprising: 

a speech recognizer residing on a first computing device; 
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said speech recognizer receiving speech training data and 
processing the speech training data into an intermediate 
representation of the speech training data, said speech 
recognizer further being operative to communicate the 
intermediate representation to a second computing 
device; 

a speech model server residing on said second computing 
device, said second computing device being intercon- 
nected via a network to said first computing device; 

said speech model server receiving the intermediate rep- 
resentation of the speech training data and generating a 
speech reference model using the intermediate 
representation, said speech model server further being 
operative to communicate the speech reference model 
to said first computing device; and 

a lexicon coupled to said speech recognizer for storing the 
speech reference model on said first computing device. 

2. The speech processing system of claim 1 wherein said 
speech recognizer receives alphanumeric text that serves as 
the speech training data and said intermediate representation 
of the speech training data being a sequence of symbols from 
said alphanumeric text. 

3. The speech processing system of claim 1 wherein said 
speech recognizer captures audio data that serves as the 
speech training data and digitizes the audio data into said 
intermediate representation of the speech training data. 

4. The speech processing system of claim 1 wherein said 
speech recognizer captures audio data that serves as the 
speech training data and converts the audio data into a vector 
of parameters that serves as said intermediate representation 
of the speech data, where the parameters are indicative of the 
short term speech spectral shape of said audio data. 

5. The speech processing system of claim 4 wherein said 
vector of parameters is further defined as either pulse code 
modulation (PCM), //-law encoded PCM, filter bank 
energies, line spectral frequencies, or cepstral coefficients. 

6. The speech processing system of claim 1 wherein said 
speech model server further comprises a speech model 
database for storing speaker- independent speech reference 
models, said speech model server being operative to retrieve 
a speech reference model from said speech model database 
that corresponds to the intermediate representation of said 
speech training data received from said speech recognizer. 

7. The speech processing system of claim 1 wherein said 
speech model server further comprises: 

a phoneticizer receptive of the intermediate representation 
for producing a plurality of phonetic transcriptions; and 

a model trainer coupled to said phoneticizer for building 
said speech reference model based on said plurality of 
phonetic transcriptions. 

8. The speech processing system of claim 4 wherein said 
speech model server further comprises: 

a Hidden Markov Model (HMM) database for storing 
phone model speech data corresponding to a plurality 
of phonemes; and 

a model trainer coupled to said HMM database for 
decoding the vector of parameters into a phonetic 
transcription of the audio data, whereby said phonetic 
transcription serves as said speech reference model. 

9. The speech processing system of claim 1 wherein said 
speech recognizer captures at least two training repetitions 
of audio data that serves as the speech training data and 
converts the audio data into a sequence of vectors of 
parameters that serves as said intermediate representation of 
the speech training data, where each vector corresponds to 
a training repetition and the parameters arc indicative of the 
short term speech spectral shape of said audio data. 
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10. The speech processing system of claim 9 wherein said representation of the speech training data, said parameters 
speech model server being operative to determine a refer- being indicative of the short term speech spectral shape of 
ence vector from the sequence of vectors, align each vector said audio data. 

in the sequence of vectors to the reference vector, determine 18. The distributed speech processing system of claim 17 

a mean and a variance of each parameter in the reference 5 wherein said vector of parameters is further defined as either 

vector computed over the values in the aligned vectors, pulse code modulation (PCM), ^-law encoded PCM, filter 

thereby constructing said speech reference model from the bank energies, line spectral frequencies, or cepstral coeffi- 

sequence of vectors. cients. 

11. A distributed speech processing system for supporting 19. The distributed speech processing system of claim 11 
applications that reside on a personal digital assistant (PDA) 10 wherein said speech model server further comprises: 
device, comprising: a Hidden Markov Model (HMM) database for storing 

an input means for capturing speech training data at the phone model speech data corresponding to a plurality 

PDA; of phonemes; and 

a speech recognizer coupled to said input means and a model trainer coupled to said HMM database for 

receptive of speech training data from said input 15 decoding said vector of parameters into a phonetic 

means; transcription of the audio data, whereby said phonetic 

said speech recognizer being operative to process the transcription serves as said speech reference model. 

speech training data into an intermediate representation 20 - ™ e s P eech processing system of claim 11 wherein 

of the speech training data and communicate the inter- „ said s P* cch recognizer captures at least two training repeti- 

mediatc representation to a second computing device; lions of audio data thal serves 35 the s P eech trainm 8 dala and 

... ... . , . . converts the audio data into a sequence of vectors of 

a speech model server residing on said second computing . . , . . \ . _ . t - c 

*, . . . . parameters that serves as said intermediate representation of 

device, said second computing device being intercon- f, »_ . * ■ j . i_ i_. j * 

. . . \_ ™ A & the speech training data, where each vector corresponds to 

nected via a network to the PDA; . . . ... ° t . . r r . 

a training repetition and the parameters are indicative of the 

said speech model server receiving the intermediate rep- 25 short term spee ch spectral shape of said audio data, 

resentation of the speech training data and generating a 2 1. The speech processing system of claim 20 wherein 

speech reference model using the intermediate said speech model operative to determine a 

representation, said speech model server further being reference vector from the sequence of vectors, align each 

operative to communicate the speech reference model veclor m the sequence Q f vectors to the reference vector, 

to said first computing device; and 30 dele rmine a mean and a variance of each parameter in the 

a lexicon coupled to said speech recognizer for storing the reference vector computed over the values in the aligned 

speech reference model on the PDA. vectors, thereby constructing said speech reference model 

12. The distributed speech processing system of claim 11 from the sequence of vectors. 

wherein said input means is further defined as: 22. A distributed speech processing system for supporting 

a stylus; 35 applications that reside on a cellular telephone handset 

a display pad for capturing handwritten stroke data from device, comprising: 

the stylus; and an input means for capturing speech training data at the 

a handwritten recognition module for converting hand- handset device; 

written stroke data into alphanumeric data, whereby the 40 a speech recognizer coupled to said input means and 

alphanumeric data serves as speech training data. receptive of speech training data from said input 

13. The distributed speech processing system of claim 12 means; 

wherein said speech recognizer segments the alphanumeric said speech recognizer being operative to process the 

data into a sequence of symbols which serves as the inter- speech training data into an intermediate representation 

mediate representation of the speech training data. 45 of the speech training data and communicate the inter- 

14. The distributed speech processing system of claim 11 mediate representation to a second computing device; 
wherein said speech model server further comprises a a speech model server residing on said second computing 
speech model database for storing speaker-independent device, said second computing device being intercon - 
speech reference models, said speech model server being nected via a network to the handset device; 
operative to retrieve a speech reference model from said 50 sa id speech model server receiving the intermediate rep- 
speech model database that corresponds to the intermediate resentation of the speech training data and generating a 
representation of said speech training data received from speech reference model using the intermediate 
said speech recognizer. representation, said speech model server further being 

15. The distributed speech processing system of claim 11 operative to communicate the speech reference model 
wherein said speech model server further comprises: S5 t0 fal computing device; and 

a phoneticizer receptive of the intermediate representation a lexicon coupled to said speech recognizer for storing the 

for producing a plurality of phonetic transcriptions; and speech reference model on the handset device. 

a model trainer coupled to said phoneticizer for building 23. The distributed speech processing system of claim 22 

said speech reference model based on said plurality of wherein said input means is further defined as a keypad for 

phonetic transcriptions. 60 capturing alphanumeric data that serves as speech training 

16. The distributed speech processing system of claim 11 data, such that said speech recognizer segments the alpha- 
wherein said input means is further defined as a microphone numeric data into a sequence of symbols which serves as the 
for capturing audio data that serves as speech training data. intermediate representation of the speech training data. 

17. The distributed speech processing system of claim 16 24. The distributed speech processing system of claim 22 
wherein said speech recognizer converts the audio data into 65 wherein said reference model server further comprises a 
a digital input signal and translates the digital input signal speech model database for storing speaker-independent 
into a vector of parameters which serves as the intermediate speech reference models, said reference model server being 
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operative to retrieve a speech reference model from said 
speech model database that corresponds to the intermediate 
representation of said speech training data received from 
said speech recognizer. 

25. The distributed speech processing system of claim 22 
wherein said speech model server further comprises: 

a phoneticizer receptive of the intermediate representation 
for producing a plurality of phonetic transcriptions; and 

a model trainer coupled to said phoneticizer for building 
said speech reference model based on said plurality of 
phonetic transcriptions. 

26. The distributed speech processing system of claim 22 
wherein said input means is further defined as a microphone 
for capturing audio data that serves as speech training data. 

27. The distributed speech processing system of claim 26 
wherein said speech recognizer converts the audio data into 
a digital input signal and translates the digital input signal 
into a vector of parameters which serves as the intermediate 
representation of the speech training data, said parameters 
being indicative of the short term speech spectral shape of 
said audio data. 

28. The distributed speech processing system of claim 27 
wherein said vector of parameters is further defined as either 
pulse code modulation (PCM), /*-law encoded PCM, filter 
bank energies, line spectral frequencies, or cepstral coeffi- 
cients. 

29. The distributed speech processing system of claim 22 
wherein said speech model server further comprises: 

a Hidden Markov Model (HMM) database for storing 
phone model speech data corresponding to a plurality 
of phonemes; and 

a model trainer coupled to said HMM database for 
decoding said vector of parameters into a phonetic 
transcription of the audio data, whereby said phonetic 
transcription serves as said speech reference model. 


10 


30. The distributed speech processing system of claim 22 
wherein said speech recognizer captures at least two training 
repetitions of audio data that serves as the speech training 
data and converts the audio data into a sequence of vectors 
of parameters that serves as said intermediate representation 
of the speech training data, where each vector corresponds 
to a training repetition and the parameters are indicative of 
the short term speech spectral shape of said audio data. 

31. The distributed speech processing system of claim 30 
wherein said speech model server being operative operative 
to determine a reference vector from the sequence of 
vectors, align each vector in the sequence of vectors to the 
reference vector, determine a mean and a variance of each 

is parameter in the reference vector computed over the values 
in the aligned vectors, thereby constructing said speech 
reference model from the sequence of vectors. 

32. A method of building speech reference models for use 
in a speech recognition system, comprising the steps of: 

collecting speech training data at a first computing device; 
processing the speech training data into an intermediate 
representation of the speech training data on said first 
computing device; 
communicating said intermediate representation of the 
speech training data to a second computing device, said 
second computing device interconnected via a network 
to said first computing device; 
creating a speech reference model from said intermediate 

representation at said second computing device; and 
communicating said speech reference model to the first 
computing device for use in the speech recognition 
system. 
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CERTIFICATE OF CORRECTION 


PATENT NO. : 6,463,413 Bl Page 1 of 1 

DATED : October 8, 2002 

INVENTOR(S) : Ted H. Applebaum and Jean-Claude Junqua 


It is certified that error appears in the above-identified patent and that said Letters Patent is 
hereby corrected as shown below; 


Title page. 

Item [73], Assignee, "Matsushita Electrical Industrial Co., Ltd." should be 
-- Matsushita Electric Industrial Co., Ltd. - 

Item [56], References Cited, U.S. PATENT DOCUMENTS, the following 

references should be added: 

-- 5,054,082 10/1991 Smith, et al. 

5,212,730 05/1993 Wheatley, et al 

5,732,187 03/1998 Scruggs, et al. -- 

Column 10, 

Line 10, "operative operative" should be - operative --. 
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